46 research outputs found
Evaluation Metrics for Unsupervised Learning Algorithms
Determining the quality of the results obtained by clustering techniques is a
key issue in unsupervised machine learning. Many authors have discussed the
desirable features of good clustering algorithms. However, Jon Kleinberg
established an impossibility theorem for clustering. As a consequence, a wealth
of studies have proposed techniques to evaluate the quality of clustering
results depending on the characteristics of the clustering problem and the
algorithmic technique employed to cluster data.Comment: Technical Repor
A Tool for Model-Based Language Specification
Formal languages let us define the textual representation of data with
precision. Formal grammars, typically in the form of BNF-like productions,
describe the language syntax, which is then annotated for syntax-directed
translation and completed with semantic actions. When, apart from the textual
representation of data, an explicit representation of the corresponding data
structure is required, the language designer has to devise the mapping between
the suitable data model and its proper language specification, and then develop
the conversion procedure from the parse tree to the data model instance.
Unfortunately, whenever the format of the textual representation has to be
modified, changes have to propagated throughout the entire language processor
tool chain. These updates are time-consuming, tedious, and error-prone.
Besides, in case different applications use the same language, several copies
of the same language specification have to be maintained. In this paper, we
introduce a model-based parser generator that decouples language specification
from language processing, hence avoiding many of the problems caused by
grammar-driven parsers and parser generators
Scanning and Parsing Languages with Ambiguities and Constraints: The Lamb and Fence Algorithms
Traditional language processing tools constrain language designers to
specific kinds of grammars. In contrast, model-based language processing tools
decouple language design from language processing. These tools allow the
occurrence of lexical and syntactic ambiguities in language specifications and
the declarative specification of constraints for resolving them. As a result,
these techniques require scanners and parsers able to parse context-free
grammars, handle ambiguities, and enforce constraints for disambiguation. In
this paper, we present Lamb and Fence. Lamb is a scanning algorithm that
supports ambiguous token definitions and the specification of custom pattern
matchers and constraints. Fence is a chart parsing algorithm that supports
ambiguous context-free grammars and the definition of constraints on
associativity, composition, and precedence, as well as custom constraints. Lamb
and Fence, in conjunction, enable the implementation of the ModelCC model-based
language processing tool.Comment: arXiv admin note: text overlap with arXiv:1111.3970, arXiv:1110.147
A DSL for Mapping Abstract Syntax Models to Concrete Syntax Models in ModelCC
ModelCC is a model-based parser generator that decouples language design from
language processing. ModelCC provides two different mechanisms to specify the
mapping from an abstract syntax model to a concrete syntax model: metadata
annotations defined on top of the abstract syntax model specification and a
domain-specific language for defining ASM-CSM mappings. Using a domain-specific
language to specify the mapping from abstract to concrete syntax models allows
the definition of multiple concrete syntax models for the same abstract syntax
model. In this paper, we describe the ModelCC domain-specific language for
abstract syntax model to concrete syntax model mappings and we showcase its
capabilities by providing a meta-definition of that domain-specific language.Comment: arXiv admin note: substantial text overlap with arXiv:1202.659
The ModelCC Model-Based Parser Generator
Formal languages let us define the textual representation of data with
precision. Formal grammars, typically in the form of BNF-like productions,
describe the language syntax, which is then annotated for syntax-directed
translation and completed with semantic actions. When, apart from the textual
representation of data, an explicit representation of the corresponding data
structure is required, the language designer has to devise the mapping between
the suitable data model and its proper language specification, and then develop
the conversion procedure from the parse tree to the data model instance.
Unfortunately, whenever the format of the textual representation has to be
modified, changes have to propagated throughout the entire language processor
tool chain. These updates are time-consuming, tedious, and error-prone.
Besides, in case different applications use the same language, several copies
of the same language specification have to be maintained. In this paper, we
introduce ModelCC, a model-based parser generator that decouples language
specification from language processing, hence avoiding many of the problems
caused by grammar-driven parsers and parser generators. ModelCC incorporates
reference resolution within the parsing process. Therefore, instead of
returning mere abstract syntax trees, ModelCC is able to obtain abstract syntax
graphs from input strings.Comment: arXiv admin note: substantial text overlap with arXiv:1111.3970,
arXiv:1501.0203
A Model-Driven Probabilistic Parser Generator
Existing probabilistic scanners and parsers impose hard constraints on the
way lexical and syntactic ambiguities can be resolved. Furthermore, traditional
grammar-based parsing tools are limited in the mechanisms they allow for taking
context into account. In this paper, we propose a model-driven tool that allows
for statistical language models with arbitrary probability estimators. Our work
on model-driven probabilistic parsing is built on top of ModelCC, a model-based
parser generator, and enables the probabilistic interpretation and resolution
of anaphoric, cataphoric, and recursive references in the disambiguation of
abstract syntax graphs. In order to prove the expression power of ModelCC, we
describe the design of a general-purpose natural language parser
A Lexical Analysis Tool with Ambiguity Support
Lexical ambiguities naturally arise in languages. We present Lamb, a lexical
analyzer that produces a lexical analysis graph describing all the possible
sequences of tokens that can be found within the input string. Parsers can
process such lexical analysis graphs and discard any sequence of tokens that
does not produce a valid syntactic sentence, therefore performing, together
with Lamb, a context-sensitive lexical analysis in lexically-ambiguous language
specifications
Treating Insomnia, Amnesia, and Acalculia in Regular Expression Matching
Regular expressions provide a flexible means for matching strings and they
are often used in data-intensive applications. They are formally equivalent to
either deterministic finite automata (DFAs) or nondeterministic finite automata
(NFAs). Both DFAs and NFAs are affected by two problems known as amnesia and
acalculia, and DFAs are also affected by a problem known as insomnia. Existing
techniques require an automata conversion and compaction step that prevents the
use of existing automaton databases and hinders the maintenance of the
resulting compact automata. In this paper, we propose Parallel Finite State
Machines (PFSMs), which are able to run any DFA- or NFA-like state machines
without a previous conversion or compaction step. PFSMs report, online, all the
matches found within an input string and they solve the three aforementioned
problems. Parallel Finite State Machines require quadratic time and linear
memory and they are distributable. Parallel Finite State Machines make very
fast distributed regular expression matching in data-intensive applications
feasible
A Model-Driven Parser Generator, from Abstract Syntax Trees to Abstract Syntax Graphs
Model-based parser generators decouple language specification from language
processing. The model-driven approach avoids the limitations that conventional
parser generators impose on the language designer. Conventional tools require
the designed language grammar to conform to the specific kind of grammar
supported by the particular parser generator (being LL and LR parser generators
the most common). Model-driven parser generators, like ModelCC, do not require
a grammar specification, since that grammar can be automatically derived from
the language model and, if needed, adapted to conform to the requirements of
the given kind of parser, all of this without interfering with the conceptual
design of the language and its associated applications. Moreover, model-driven
tools such as ModelCC are able to automatically resolve references between
language elements, hence producing abstract syntax graphs instead of abstract
syntax trees as the result of the parsing process. Such graphs are not confined
to directed acyclic graphs and they can contain cycles, since ModelCC supports
anaphoric, cataphoric, and recursive references
An Automorphic Distance Metric and its Application to Node Embedding for Role Mining
Role is a fundamental concept in the analysis of the behavior and function of
interacting entities represented by network data. Role discovery is the task of
uncovering hidden roles. Node roles are commonly defined in terms of
equivalence classes, where two nodes have the same role if they fall within the
same equivalence class. Automorphic equivalence, where two nodes are equivalent
when they can swap their labels to form an isomorphic graph, captures this
common notion of role. The binary concept of equivalence is too restrictive and
nodes in real-world networks rarely belong to the same equivalence class.
Instead, a relaxed definition in terms of similarity or distance is commonly
used to compute the degree to which two nodes are equivalent. In this paper, we
propose a novel distance metric called automorphic distance, which measures how
far two nodes are of being automorphically equivalent. We also study its
application to node embedding, showing how our metric can be used to generate
vector representations of nodes preserving their roles for data visualization
and machine learning. Our experiments confirm that the proposed metric
outperforms the RoleSim automorphic equivalence-based metric in the generation
of node embeddings for different networks